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Abstract 

Background: While some studies have shown that the 3D protein structures are more conservative than their 
amino acid sequences, other experimental studies have shown that even if two proteins share the same topology, 
they may have different folding pathways. There are many studies investigating this issue with molecular dynamics 
or Go-like model simulations, however, one should be able to obtain the same information by analyzing the proteins' 
amino acid sequences, if the sequences contain all the information about the 3D structures. In this study, we use 
information about protein sequences to predict the location of their folding segments. We focus on proteins with a 
ferredoxin-like fold, which has a characteristic topology. Some of these proteins have different folding segments. 

Results: Despite the simplicity of our methods, we are able to correctly determine the experimentally identified folding 
segments by predicting the location of the compact regions considered to play an important role in structural 
formation. We also apply our sequence analyses to some homologues of each protein and confirm that there are 
highly conserved folding segments despite the homologues' sequence diversity. These homologues have similar 
folding segments even though the homology of two proteins' sequences is not so high. 

Conclusion: Our analyses have proven useful for investigating the common or different folding features of the 
proteins studied. 

Keywords: Folding initiation segment prediction, Sequence analysis, Inter-residue average distance statistics, 
Evolutionarily conserved folding, Ribosomal protein S6, Procarboxypeptidase A2, U1A Spliceosomal protein, 
mt-Acylphosphatase 



Background 

Clarifying how a protein folds into its unique 3D structure 
is a very significant yet unsolved problem in molecular 
biophysics and bioinformatics [1]. In particular, some re- 
cent experimental studies have revealed that proteins 
sharing the same topology can take different folding path- 
ways [2-10]. 

Ferredoxin-like fold proteins are well-known proteins 
that fold via different folding pathways. Their topology is 
composed of 2 a helices and 4 (3 strands, and the second- 
ary structure arrangement seems similar to the (3/a/(3 triad 
motif in flavodoxin [11,12] or TIM-barrel proteins [13]. 
However, the connectivity of the secondary structures 
differs. While flavodoxin or TIM-barrel proteins have a 
parallel (3 sheet, ferredoxin-like proteins have an anti- 
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parallel (3 sheet. Therefore, it is hard to explain the 
ferredoxin-like proteins' folding behavior only with the 
formation of (3/a/(3 triads and the interaction among 
subdomains as in the case of flavodoxin, even if it is true 
that most of the hydrophobic contacts are composed of 
He, Leu and Val residues as reported in the literature [14]. 

Ferredoxin-like fold proteins are relatively small as 
shown in Figure 1, but they have interesting features in 
the structural transformation from denatured states to 
transit or native states. For example, one ferredoxin-like 
protein called ribosomal protein S6 contains two over- 
lapping foldons, which fold cooperatively, located at dif- 
ferent termini, and the overlapping makes this protein 
fold in a two-state manner as reported by Haglund et al. 
[15]. However, other proteins such as U1A spliceosomal 
protein or procarboxypeptidase A2 fold via the N- or C- 
terminal foldon as reported by Ternstrom et al. [16] or 
Villegas et al. [17], respectively. 
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Figure 1 Proteins treated in this study, (a) Tertiary structures of the study proteins obtained from the Protein Data Bank [18]. Their names are 
shown above the structures, a helices and (3 strands are represented with helices and arrows, respectively. These protein models are created by 
Visual Molecular Dynamics 1.9.1 [19]. (b) Amino acid sequences of the study proteins. The filled circles above them denote the location of decadal 
number residues. Arrows and helices below the sequences indicate the location of secondary structures. 



If all the information related to the formation of 3D 
structures is encoded in the amino acid sequences, we 
should be able to decode these sequence differences to 
obtain their folding features. Still, this is a challenging 
task. How the folding mechanism of a protein is coded 
in the amino acid sequence information remains an im- 
portant issue to be clarified. 

There are some bioinformatics approaches for predict- 
ing some aspects of folding mechanisms, like folding rate, 
from the amino acid sequence [20-27]. Nevertheless, it is 
still difficult to extract more detailed information on the 
folding mechanism, that is, how each protein folds. 

The fact that the topologically equivalent proteins do 
not always fold via the same folding pathway leads us to 
the question of whether evolutionarily related proteins 
really have common folding properties. Evolutionarily 
related proteins have been observed to be possible to 
fold via different folding pathways. For example, in spite 
of the fact that PDZ2 and PDZ3, both of which contain 
mainly (3 sheets, are evolutionarily related and have more 
than 30% sequence identity, they do not share the same 
folding mechanism, at least in the early stage of folding 
[28]. On the other hand, fibronectin and titin, which are 
evolutionarily unrelated proteins but have the same top- 
ology, share the folding mechanism involving four key res- 
idues and their peripheral residues [29]. There are also 
some other studies focusing on the differences in the fold- 
ing of topologically equivalent proteins [3]; yet, these ex- 
periments were performed only for several proteins of 
each topology and were not applied to a whole family. 

In this study, we aim to decode such information for 
well-studied ferredoxin-like fold [30] proteins by analyzing 
their amino acid sequences, not only with the preiously 



mentioned bioinformatics approaches but also with our 
own analyses. The methods we apply here are homologous 
sequence search, phylogenetic analysis and sequence- 
based analyses by means of inter-residue average distance 
statistics. 

The methods, which are based on the inter-residue 
average distance statistics [31,32] using the amino acid 
sequences as input, have so far provided valuable infor- 
mation on the initial folding segments that play crucial 
roles in the structural formation in the cases of lyso- 
zyme, leghemoglobin, fatty acid-binding protein, azurin, 
and two ancient TIM-barrel proteins [33-36]. We also 
apply our methods to some evolutionarily related pro- 
teins of four ferredoxin-like fold proteins to examine 
whether evolutionarily related proteins have common 
folding properties. 

Methods 

Proteins treated in this study 

The proteins treated in this study are U1A spliceosomal 
protein (U1A) [PDB: 1URN] [37], procarboxypeptidase A2 
(ADA2h) [PDB: 106X] [38], ribosomal protein S6 (S6) 
[PDB: IRIS] [39], and muscle-type acylphosphatase (mtAcP) 
[PDB: 1APS] [40] as shown in Figure 1. These were selected 
through the Protein Folding Database 2.0, [41] which pro- 
vides experimental folding data on proteins [15-17,42]. We 
call these proteins our "study proteins". The amino acid se- 
quences of these proteins were obtained from the structured 
region in their PDB files. Their sequence identities are quite 
low, ranging from 11 to 23%. Many experimental studies on 
these proteins have been performed with respect to the 
ferredoxin-like fold, and some of these studies suggest the 
existence of different folding segments [15-17,42]. 



Matsuoka and Kikuchi BMC Structural Biology 2014, 14:15 
http://www.biomedcentral.eom/1 472-6807/1 4/1 5 



Page 3 of 15 



Inter-residue average distance statistics 

To prepare the statistics for our methods, we calculate 
the average distance and its standard deviation for each 
inter-residue pair in 42 various proteins, considering the 
amino acid types and the sequence separation. For the 
sequence separation, we simplify the sequence separa- 
tions k: 1 ~ 8, 9 ~ 20, 21 ~ 30, 31 ~ 40... in terms of the 
ranges M: 1, 2, 3, 4..., respectively. The 42 representative 
proteins were carefully chosen so as not to be biased to- 
wards some specific structures and have been confirmed 
to extract the regions corresponding to the structural 
domains. Because our analysis results are strongly af- 
fected by the particular protein datasets used, we chose 
not to alter the datasets based on the analysis results and 
to instead use the same datasets as in the first paper on 
ADM (Ref. [31]) to allow for comparability. We present 
the 42 proteins (Additional file 1: Table S3). 

Average Distance Map analysis 

The regions predicted by Average Distance Map (ADM) 
analysis correspond to the regions that tend to be com- 
pact in their 3D structure. We believe that these com- 
pact regions might be structured in the early stage of 
folding. 

The ADM analysis itself is a method for predicting the 
location of possible structural units in a protein by ana- 
lyzing a predicted contact map [31] based on inter- 
residue average distance statistics. This map is referred 
to as an ADM and is used to extract standard structural 
units, such as structural domains or compact regions. In 
the ADM, any pair of residues with smaller average dis- 
tance is considered to be in contact more than the other 
pairs, so the segment with many such pairs is considered 
to form a structural unit like folding segments by mech- 
anisms such as hydrophobic collapse. These segments 
are automatically extracted by analyzing the ADM (as 
explained in the following text). 

In the current study, we use this method to extract 
compact regions (not structural domains). For each 
compact region, the strength as a predicted folding seg- 
ment is expressed as an rj value. The rj value tends to be 
higher if the corresponding compact regions have many 
contacts within the region or with the other regions in- 
cluding non-local areas (for more details, see Refs. 
[31,43] or [Additional file 1]). Contact density has been 
reported to correctly identify the nucleating subdomains 
in T4 lysozyme and interleukin-lp [12]. Studies on 
flavodoxin-like proteins also suggest a relationship 
betrween contact density and the folding rate of the cor- 
responding area by showing that low contact density 
leads to structuring late in folding [44]. Thus, it would 
be reasonable to consider the regions with many con- 
tacts (a high rj value) as the region structured in the 
early stage of folding. 



Finally, all high- r) -value units which do not overlap 
with other high-n -value units are designated as predicted 
folding segments, except for units that cover the whole 
sequence. When a predicted folding segment covers 70 
to 100% of the whole sequence length, we conduct an 
additional search for folding segments overlapping in 
this unit. Because we could find only two folding seg- 
ments for each protein, in this study, we call the seg- 
ment with the higher r) value the primary segment and 
the other, the auxiliary segment. 

In Additional file 1: Figure S5, we compare the ADMs 
of the ferredoxin-like proteins with the actual contact 
map constructed based on the contacts defined later. 

Comparison of the regions predicted by two ADMs 

Suppose that a multiple alignment of homologous se- 
quences in a ferredoxin-like protein is obtained. Since it 
is convenient to define the similarity of location pre- 
dicted by any two ADMs in the multiple alignment, we 
define the similarity as follows: First, two sequences are 
chosen from the aligned sequences, as shown in Figure 2. 
Second, all sites with a gap in either one or both se- 
quences are removed. Here, "site" refers to the common 
sequential number in the multiple alignment. Finally, the 
number of sites that are commonly included in or ex- 
cluded from the regions predicted by the ADMs for two 
given sequences is calculated. The ratio of this number 
to the number of all the non-gapped sites is defined as 
the similarity of location predicted by the two ADMs. 

It should be noted that the similarity calculated by this 
procedure does not take rj values and gapped sites into 
account: the present method is therefore not suitable in 
cases where the sequences in an alignment show large 
gaps. Having said that, the multiple alignments using the 
sequences obtained in this study contain few small gaps. 
For this reason, we can apply this definition of the simi- 
larity to the present results. 

F-value analysis 

We performed additional analyses to determine the loca- 
tion where initial folding events, such as hydrophobic 
collapse, happen [32]. Using other statistical potentials 
like Miyazawa-Jernigan [45] and Skolnick [46] do may 
return similar information, but it makes difficult to in- 
terpret the results with ADM. Because there is not only 
the average distance but also the standard deviation of 
distance in inter- residue average distance statistics, we 
expect that the potential used in our F-value analysis to 
better reproduce the dynamics of the denatured state en- 
semble compared to the potential based on the contact 
energy. In F-value analysis, we use the Ca bead model to 
represent a protein s structure, as well as the Metropolis 
Monte Carlo method with the potential energy derived 
from average distance fjj and its standard deviation a^. 
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Figure 2 Example of two sample aligned sequences from multiple alignment. A region in gray corresponds to a compact region predicted 
by the ADM. Because there are 23 sites that have no gap in either sequence and there are 21 sites that are commonly included in or excluded 
from the predicted compact regions, 21/23 ~ 91.3% is the ADM similarity for this example. 



The bond and dihedral angles of the initial conformation 
are randomly selected. The movement in the simulation is 
done as follows: The bond angle between the residue i 
and i + 1 is bent and rotated randomly from -10 to 10° 
followed by the Metropolis judgment to decide if the new 
conformation is acceptable or not. Within a step, i= 1... 
N-l is performed, that is, all the bond angles are altered 
and judged. 

The probability density with the potential energy be- 
tween two residues, P(ey), is hypothesized as being 
equivalent to the probability density based on the stand- 
ard Gaussian distribution calculated with its average dis- 
tance and standard deviation, p(r~jj,aij), as follows: 

P(e iJ )=p(fij,a iJ ) (1) 
Where this equation can be expressed by equation (2); 
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Finally we obtain equations (3) and (4); 
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where kT is set so that the acceptance ratio is 0.5. This 
potential is designed to sample the ensembles which can 
reproduce the inter-residue average distance statistics. From 
the simulation, the contact frequency, g(i,j), for each pair of 
residues is calculated with sampled structures generated 
using the potential energy function. Then we normalize the 
residue contact frequencies, g(i,j), in the same range M as 
follows: 
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where ft or v is the residue number. Finally, the relative 
contact frequency, F i} is obtained by summing the nor- 
malized contact frequencies, Q(i,j), from j = 1 to N for 
each residue i, where N is the protein sequence length: 



£q(«-,/) 



(7) 



The peaks of the plots of the F r or F-value peaks are 
thought to be located in the center of many inter-residue 
contacts, such as a hydrophobic cluster. Therefore, the re- 
gions around the peaks are assumed to be important for 
folding, especially for its initial state. F-value analysis there- 
fore allows us to estimate the location where a folding initi- 
ation occurs, except for the termini with their expected 
extreme flexibility in the simulation: due to the flexibility, 
the Fi values at the terminal residues become unrealistically 
high, and this value is then considered not to be true [47]. 
We performed this simulation with 60000 steps 100 times, 
calculating the average Fi value for residue i. 

Analysis of evolutionarily conserved residues 

Evolutionarily conserved residues maintain a proteins 
function, contribute to its stability, or relate to its struc- 
tural formation [48-53]. Therefore, conserved residues 
that have many contacts with other conserved residues 
in the native structure are thought to be significant indi- 
cators of potential folding segments. Based on this idea, 
we gather the homologous sequences for each study pro- 
tein with the Basic Local Alignment Search Tool (BLAST) 
[54] (DB: UnireflOO, Threshold: 0.01, Gapped: No) and 
aligned them with the Multiple Sequence Comparison by 
Log-Expectation tool (MUSCLE) [55]. We applied the 
neighbor-joining method [56] to construct the study 
proteins phylogenetic tree for inputting into the Phylo- 
genetic Analysis by Maximum Likelihood software 
package (PAML) [57]. With PAML, for each site without 
any gap, we can count the number of residue substitutions 
by using JTT matrices [58] for the substitution matrix and 
a Poisson distribution for the substitution model. This 
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procedure allows us to estimate the conservation or sub- 
stitution of a specific residue during evolution based on 
branch lengths and bifurcations in a phylogenetic tree. Be- 
cause only the conservation of hydrophobic residues is 
taken into account in this study, the hydrophobic residues 
with more than 99% conservation are regarded as con- 
served residues, that is, we still regard a residue as con- 
served, when one of the hydrophobic residues A, M, W, L, 
F, V, I, and Y has mutated to another one. 

We employed the BLAST to identify potential hom- 
ologous sequences. Only sequences that cover the whole 
sequence of a study protein were selected as homolo- 
gous sequences based on the BLAST results. The 
BLAST search identified at least 100 homologous se- 
quences for each study protein were obtained (see Add- 
itional file 1: Table SI). 

Definition of inter-residue contacts 

The Shrake-Rupley algorithm [59] was used to define a 
contact by the decrease in the Solvent Accessible Surface 
Area (SASA) upon folding. The reduction in the surface 
area is calculated by the difference between a sidechains 
SASA in the presence of contacts with other residues and 
that in the absence of contacts. In this study, only heavy 
atoms are considered, and when the decrease in the SASA 
reaches 27 A 2 , the corresponding hydrophobic residue 
pair is defined as being in contact. This threshold was de- 
termined from the decrease in the SASA when two carbon 
atoms form a contact, namely, 27.27 A 2 . 



Summary of the experimental results from the literature 

Figure 3 provides a summary of the results reported in 
the literature from various O-value studies which pro- 
vide information on structured sites in the transition 
state [60]. We compare the regions predicted by ADM 
with the location of secondary structures with high O 
values. Averaging the O values for each secondary struc- 
ture is a good way to understand the differences in fold- 
ing mechanisms among a set of proteins with the same 
topology and this method has been performed by many 
researchers (for instance, TNfn3/CD2dl [61], mt/sso-AcP, 
[62] mt-AcP/ADA2h, [17,42] wt-S6/permutants [15], and 
so on). We validated our predictions by comparing them 
to experimental results interpreted in the same manner. 
(We need to note that there are a few residues with high 
O values which should be excluded. For example, P54A in 
mtAcP has a high O value of 0.98. However, according to 
the 3D structure, its side chain seems exposed; thus, its 
high O value seems to be derived from the unique dihe- 
dral angles of the proline residue, and we did not treat it 
as a member of the folding segment). 

In this study, a few secondary structures with relatively 
high O values are defined as an experimental folding 
segment. Even though the resolution is somewhat lower 
because the folding segment is defined by average O 
values, this approach is still similar in concept to the 
"folding nucleus" first introduced by Shaknovich and his 
colleagues [63] as a set of contacts in denatured states that 
are considered sufficient and necessary for transitioning or 
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Figure 3 Experimental <f> values and average values for each secondary structure, (a) U1 A, (b) ADA2h, (c) S6, and (d) mtAcP. Dots denote 
the experimental O values. Gray bars indicate the average O value for each secondary structure. Because no O values in the 3rd (3 strand of U1A 
have been reported, its average value is not shown. 
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molten globule states to occur as observed by a computa- 
tional technique. The formation of these contacts is rate- 
limiting step and should be done by its transition state. 

Some later studies support the idea of a folding nu- 
cleus by means of O-value analysis or combining experi- 
mental O values with computational techniques [64,65]. 
In other words, the folding segment is thought to be 
relatively structured compared to other regions from the 
denatured state to transition state. (This is because high- 
O-value sites correspond to the sites which are energet- 
ically stable in the transition state, and we expect that 
such an energetically stable region forms even in the 
early stage of folding.) For example, for the wild-type 
ribosomal protein S6 from Thermus Thermophilics, one 
of the ferredoxin-like folds studied by Haglund et al, 
[15] one folding nucleus is reported to consist of (31, al, 
and (33 (despite the al and a2 having very similar average 
O values). However, for the circular permutants prepared 
by connecting N- and C-termini and disconnecting other 
loops between neighboring secondary structures, some- 
times the folding nucleus noticeably shifts to (31, a2, and 
|34 [15]. Therefore Haglund et al. [15] found that in S6, 
there are two competing and overlapping folding nuclei, 
and the relative magnitude of significance for folding could 
be perturbed by circular permutation. (In Additional file 1: 
Figures S3 and S4, we also summarized the results of the 
ADM and F-value analyses for the circular permutant of 
S6 whose X-ray crystallographic structure is available. This 
means there is a guarantee that the native structure is not 
disordered or structured in other conformations, thereby 
making its O values seem reliable. We obtained similar re- 
sults as in previous analyses). Ternstrom et al. [16] also 
performed an experimental study on U1A spliceosomal 
protein, which also folds through the al formation and its 
surrounding secondary structures in an early stage. How- 
ever, Villegas et al. [17] reported that procarboxypeptidase 
A2, which has the same topology as S6 and U1A, folds 
through the a2 formation and part of (3l/|33. Its folding 
segments seem to be in the C-terminus ((31, a2, |33) like 
the S6 permutants (|3l, a2, (34 in the S6 wild type), while 
there are some differences around the (3 strands. Specific- 
ally, there seems to be two tendencies with respect to the 
location of the folding segments: the N-terminus with al 
and the C-terminus with a2, but it remains difficult to say 
which (3 strands contribute to folding. Therefore in this 
study, we simply chose the (3 strands that have higher aver- 
age O values compared to the values of the a helix with a 
lower average O value than the other a helix. 

Results 

Folding segments predicted by ADM analysis 

The predicted contact maps and the location of the 
compact regions, namely, the predicted folding segments 
are shown in Figure 4 and summarized in Table 1. 



According to these data, all four proteins have two com- 
pact regions, and each region contains one a helix with 
a couple of |3 strands. 

For U1A spliceosomal protein ([PDB: 1URN]) and 
muscle-type acylphosphatase 2 [PDB: 1APS], the N- 
terminal compact region has a higher r) value than that 
of the C-terminus, which suggests that the N- terminal 
region is stable compared to the C-terminal region dur- 
ing the early stage of folding, while procarboxypeptidase 
A2 [PDB: 106X] shows the opposite trend. As for ribo- 
somal protein S6 [PDB: IRIS], the two compact regions 
have similar r) values, which is interpreted as meaning 
that both of these regions play equally important roles in 
structural formation. It is also notable that except for 
the case of mtAcP, |33 is always included in the primary 
folding segments (the predicted region with a higher rj 
value) for each protein. 

We are interested in comparing the predicted folding 
segments with the secondary structures whose average 
O values are high. Figure 5a-d show the results. Accord- 
ing to these figures, the secondary structures within the 
predicted folding segments often correspond to the sec- 
ondary structures with high average O values. 

In Figure 5, the positions of the secondary structures 
with higher average O values than those of the a helix 
(taking the lower O value of the two a helices) are col- 
ored red in the right panel, while in the left panel, the 
positions of the predicted primary folding segment at 
the N- or C-terminus are colored yellow or green, re- 
spectively. For S6, however, we color both segments in 
red or yellow/green, because according to the average O 
values and r) values of S6 (see Figure 3 and Table 1, re- 
spectively), both N- terminal and C-terminal folding seg- 
ments are equally important in the formation of its 3D 
structure. Figures 3, 4, and 5 indicate that almost all the 
important secondary structures for folding, as defined by 
experimental O value results, are included in the folding 
segments predicted by the ADMs, although |33 in mtAcP, 
which shows a relatively high O value, is not included in 
the region predicted by the ADM. 

Evolutionary conservation analysis with F-value analysis - 
Location of predicted hydrophobic clusters 

The solid line in Figure 6 indicates the F-value result, 
while the broken line shows the smoothed plots of the 
number of contacts with other conserved hydrophobic 
residues. (The conserved hydrophobic sites are shown in 
Additional file 1: Figure S8.) Smoothing was performed 
with a Gaussian kernel [66]. A conserved hydrophobic 
contact means that a pair of conserved hydrophobic resi- 
dues form a contact. The locations of the peaks are indi- 
cated by single or double daggers for F-value or conserved 
hydrophobic contacts, respectively. A number near a dag- 
ger indicates the corresponding residue number in a 
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Figure 4 Results of ADM analyses, (a) U1 A, (b) ADA2h, (c) S6, and (d) mtAcP. The color bars on the diagonal of a predicted contact map 

indicate the location of secondary structures. The abscissa and ordinate denote residue numbers, and triangles with a solid line in red or black 

indicate the location of primary or auxiliary compact regions, respectively. A large triangle with a broken line means it is ignored because it 
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protein. We follow the PDB system concerning the resi- 
due number in a protein. The secondary structures and 
conserved hydrophobic residues are shown below the 
plot. 

Except for several sites, most of the conserved hydro- 
phobic residues are distributed somewhat sparsely but 
uniformly, which implies that it is hard to extract folding 
segments from only their amino acid sequences and 
conservation analyses. According to Figure 6, most of 
the F-value peaks are close to those of the smoothed line 
within ± 3 residues. This indicates that F-value peaks, 
which can be mainly regarded as hydrophobic clusters in 
the initial nucleation stage, also correspond to the region 
with many conserved hydrophobic contacts, which are 
important for the formation of a native structure. 

Direct comparisons of these regions with high- O- value 
sites are shown in Additional file 1: Figure S9. For high- 
O-value sites, only the sites with a O value higher than 



the average O value of the protein are shown along with 
each sites residue type and number. The peaks of 
smoothed conserved hydrophobic contacts are marked by 
double daggers as in Figure 6. High-O-value sites are 
found to exist near the peaks of the conserved hydropho- 
bic contacts (within ± 3 residues), suggesting that some of 
these contacts are responsible for structural formation. 
These high-O-value sites are also found to exist near the 
F-value peaks within ± 3 residues, as shown in Additional 
file 1: Figure S10. 

Folding segments predicted by ADMs in the homologous 
proteins of the study proteins 

To confirm whether the folding segments are conserved 
among evolutionarily related proteins, we performed our 
sequence analysis on the homologues of the four study 
proteins. The results of applying ADM analyses to these 
homologues are shown in Figure 7. 
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Table 1 Summary of the Average Distance Map (ADM) 
Analyses 



PDB entry 


N-termini 


C-termini 


n 


Dominance 




12 


95 


0.1 12 b 




1 URN 


12 


59 


0.110 


N 




62 


95 


0.047 






4 


29 


0.066 




106X 












50 


80 


0.197 


C 




6 


91 


0.1 15 b 






6 


30 


0.091 


N 


IRIS 












52 


91 


0.087 a 






60 


91 


0.095 


C 




1 


39 


0.154 


N 


1 APS 


1 


47 


0.1 32 a 






61 


96 


0.036 





N or C denote the N- or C-terminal borders of compact regions. The primary 
compact regions are shown with N or C in the Dominance column. In the q 
column, compact regions extended by the 85% rule [31] are identified with a 
superscript a; compact regions ignored due to covering more than 70% of the 
entire sequence are identified with a superscript b. 



This figure denotes the respective multiple alignments 
of the homologues. The location of the predicted folding 
segments are colored dark gray: the brighter the color, 
the higher the regions r) value. It can be visually con- 
firmed in Figure 7a-c that there are several bands indi- 
cating that for most of the homologues the same regions 
are predicted. 

In Figure 7, we ordered the sequences based on the 
similarity of the location of the regions predicted by the 
ADMs. In the right column, the phylogenetic tree based 
on an ADM similarity matrix and the neighbor-joining 
method is shown. Another result based on the sequence 
identity is shown in Additional file 1: Figure S7. It is dif- 
ficult to determine the relationship between the location 
of the folding segments and their evolutionary distance 
(as specified by the calculated sequence identity). How- 
ever, we can conclude that for U1A, ADA2h, and S6, the 
folding segments themselves are conserved among their 
homologues, while those of mtAcP are not. 

To represent these common folding segments, we cal- 
culate the percentage of residues that are members of 
the predicted folding segments for each site. The results 
are also shown as a histogram colored black in Figure 7. 

In the case of U1A, four secondary structures (3a(3(3 at 
the N-terminus form one strong folding segment for 
most of the homologues, while the other C-terminal re- 
gion comprises a weak folding segment. For ADA2h, the 
C-terminal folding segment pap is conservative and 
strong, and the N-terminal folding segment is conserva- 
tive but weak. 

As for S6, there are many homologues, and they share 
almost the same folding segments. One segment consists 



of pap at the C-terminus and the other one consists of 
pa at the N-terminus. The dominance of these two 
folding segments at the termini often differs among the 
homologues. It is also notable that for some homolo- 
gous proteins, the region from p2 to the hairpin-loop 
comprises the weakest folding segment, which forms a 
p-hairpin with p3 in the C-terminal folding segment. 

Finally, for mtAcP, the folding segments are not con- 
servative among its homologous proteins. However, the 
locations of the folding segments appear similar to those 
of S6, ADA2h, and U1A. 

Discussion 

The ADM analyses of the four proteins predict two 
compact regions including one a helix and a couple of p 
strands for each protein. These regions contain the sec- 
ondary structures with high average O values (Figure 3). 
Therefore, we consider the predicted compact regions 
to correspond well to the folding initiation segments as 
was the case for the other proteins we treated in previ- 
ous studies, including lysozyme, leghemoglobin, fatty 
acid binding protein, azurin, and two ancient TIM- 
barrel proteins [33-36]. According to the r) values, 
mtAcP and U1A have the primary predicted folding ini- 
tiation segment at the N-terminus, whereas ADA2h has 
one at the C-terminus. On the other hand, the two folding 
segments of S6 have similar rj values (see Figure 4). Figure 5 
shows good agreement between the ADM predictions and 
the experimental results; however, the resolution of this 
analysis is too low to predict the folding mechanisms. 

By means of F-value analysis, we increased the reso- 
lution of the prediction made by ADM, allowing us to 
identify the regions that would form some hydrophobic 
clusters. According to Figure 6, almost all the peaks are 
located on the secondary structures or their edges, and 
the highest peak is located in the primary folding seg- 
ment for each protein. For example, U1A has the pri- 
mary folding segment at the N terminus from pi to p3, 
and its highest F-value peak is located in p3. In addition, 
a peak of F values is located at a peak of the smoothed 
line of the conserved hydrophobic contacts within ± 3 
residues, except for the broad peak observed at the C- 
terminus of ADA2h, which contains several minor 
peaks. These conserved contacts are thought to play im- 
portant roles in the structural formation or stabilization 
of UlAs native structure [48-53]. Taking these facts 
into account, the conserved hydrophobic residues near 
the F-value peaks are considered to be significant for 
the folding initiation. The basis for predicting the fold- 
ing mechanisms from only sequence information is the 
fact that the regions predicted by ADM analysis contain 
the high-O-value residues measured by experiments 
[15-17,42] and that the F-value analysis reflects the 
conserved hydrophobic contacts. Let us now make 
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Figure 5 Comparison of predicted folding segments and experimental folding segments, (a) U1 A, (b) ADA2h, (c) S6, and (d) mtAcP. In the 

left column, the predicted primary folding segments located at the N- or C-termini are colored orange or green, respectively. In the right column, 

all the secondary structures with an average O value higher than that of the a helix with the lower average O value are colored red. However, 

for S6, the (3-strand 3 and a-helix 1 are also colored in red, because their average O values are not significantly lower than the averge O value of 

the a helix with the higher value, unlike the case in other proteins. 
^ J 
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auxiliary primary primary auxiliary 



Residue number 

Figure 6 Results of F-value analyses and the distribution of conserved hydrophobic contacts, (a) U1 A, (b) ADA2h, (c) S6, and (d) mtAcP. 
F values or the smoothed number of conserved hydrophobic contacts are shown as a solid or broken line, respectively. The ordinate denotes the 
F value or the number of conserved hydrophobic contacts and the patterns along the abscissa show the location of secondary structures. The 
conserved amino acid residues and the location of predicted folding segments are also shown below the plot. The F-value peaks that were the 
focus of this study are marked with single daggers (t), and the number above each dagger denotes the residue number of the respective peak. 
The smoothed number of conserved hydrophobic contacts is in arbitrary units, and the peak location is shown with a double dagger (i=) like the 
F-value peaks. Only for U1 A, the shoulder is indicated with parentheses. 



inferences regarding the folding mechanisms for pro- 
teins based on the results of the ADMs and F-value 
analyses. 

U1A spliceosomal protein; U1A 

As shown in Figure 6a, the primary compact region of 
U1A covers pi, al, (32, and p3, and each region of al, 
(32, and p3 contains just one F-value peak. The auxiliary 
compact region of U1A ranges from a2 to a3. Because 
the auxiliary compact region has a lower rj value, it is 
thought to participate in the structural formation after 
the primary compact region has been formed. Since 
Ternstrom et al. [16] suggest that the region from pi to 
p3 is more structured compared to a2 and p4 [16,67], we 
find that our results agree well with their experimental O- 
value analysis. Figure 8a presents the packing formed by 
conserved hydrophobic residues near the F-value peaks. 
The residues that contribute to the hydrophobic packing 
are represented in the CPK model in this figure. The re- 
gions colored yellow or green correspond to the predicted 
N- or C-terminal compact regions, respectively. 

Procarboxypeptidase A2; ADA2h 

The auxiliary compact region of ADA2h at the N-terminus 
has two secondary structures, pi and al, whereas the 
primary compact region at the C-terminus has three 
secondary structures, p3, a2, and p4. The former has a 



high F-value peak around al, indicating that al is the cen- 
ter of folding within the auxiliary compact region. On the 
other hand, the primary compact region has the highest 
and broadest peak from p3 to a2. Therefore, we can pre- 
dict that after p3 and a2 form a folding segment, pi and 
al, pack with this segment and stabilize it. 

This prediction can also be validated by Figure 3b. Villegas 
et al. [17] state that the folding segment in this protein 
consists of a2 and its surrounding p strands. This 
agrees with our result. However, we could not confirm 
any packing between the conserved hydrophobic resi- 
dues near the F-value peaks in Figure 8b within the pri- 
mary compact region in the native structure as observed 
in U1A. This is because the region with the largest broad 
F-value peak in the C-terminal region seems to have only 
a few conserved hydrophobic residues as indicated by the 
smoothed plot of the conserved contacts which shows 
several minor peaks here. In this case, the resolution of 
the F-value line is too low to detect the residues important 
for folding. 

Ribosomal protein S6; S6 

The relative auxiliary folding segment of S6 at the N- 
terminus contains pi and al, while the primary folding 
segment at the C-terminus contains p3, a2, and p4. The 
rj values are quite similar, so we cannot say which region 
folds more dominantly. S6 has two significant F-value 
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Residue number 

Figure 7 Results of applying ADM analysis to the homologues 
of the study proteins, (a) U1 A, (b) ADA2h, (c) S6, and (d) mtAcP. 
Each line corresponds to one homologue, and the dark gray colored 
regions are the compact regions predicted by the ADMs. The lighter 
the color is, the higher its q value. These homologues are sorted by 
the ADM similarity. On the right side, the tree made from the ADM 
similarity matrix by means of the neighbor-joining method [56] is 
shown. The location of the secondary structures or the ratio of being 
included in ADMs for each site is shown above (by arrows or helices 
for (3 strands and a helices) or below the results of the ADM analyses 
(as a histogram), respectively. 



peaks within the predicted folding segments: one is 
around al at the N- terminus, and the other is around 
(33 at the C-terminus. Lindberg et al [67] suggest that 
the primary folding segment consists of (31, al, and (33 
in the early folding stage, and our results reflect this. 
Figure 8c shows the residue packing near the F-value 
peaks inside the predicted folding segments. 

It is also notable that in the case of S6, there is a highly 
frustrated region between the C-terminal unstructured 
coil and the |3 sheet based on the structure of S6 [68]. 
However, the corresponding C-terminal region does not 



have any specific contact with other regions in the ADM. 
This is confirmed by its NMR structure ([PDB: 2KJV]). At 
least as far as concerning the ADM result and the NMR 
structure, the frustration between the C-terminal unstruc- 
tured coil and the |3 sheet does not seem strong. 

Muscle-type acylphosphatase 2; mtAcP 

The primary folding segment of this protein, which has a 
significantly higher rj value than the other folding seg- 
ment, is located at the N-terminus. This result is the same 
as the result from Selvaraj et al. [69] who suggested the 
existence of a hydrophobic cluster surrounding al based 
on the distribution of contacts. The primary segment con- 
tains (31, al and (32, whereas the auxiliary segment con- 
tains a2, |34, and (35. There are four F-value peaks near al, 
(32, (33, and a2; three of them are located in the predicted 
folding segments, but the peak near (33 is not. (In fact, (33 
belongs to the primary folding segment and seems to play 
a critical role in the folding in other proteins.) Therefore 
we propose that after al and |32 play a role in early struc- 
tural formation, a2 participates in the last structural for- 
mation, followed by contributions from (33. This scenario 
does not seem to fit the results of the average O-value 
analysis [42], which indicate that |33 plays a more import- 
ant role than (32 (see Figure 2d). 

According to Parrini et al. [70], when (33 is forced to join 
the folding process by a disulfide bond between (31 and 
(33, the folding rate is improved dramatically. This result 
suggests that the participation of (33 in the folding process 
is rate limiting and may reflect the present findings. For 
this reason, we do not consider the two analyses to con- 
flict with each other. The inter-segment packing between 
the conserved residues is represented in Figure 8d. 

Meanwhile, it should be noted that in their analysis, 
Chiti et al. [42] ignore the highest O value of the 23rd 
residue located in al and then compare the result with 
that of ADA2h. In this case, the average O value of a2 is 
higher than that of al, indicating that the more struc- 
tured secondary structure in its transition state is a2, the 
same as for ADA2h. There are also several studies that 
suggest that the a2 in mtAcP is more important than al 
[71-74]. Interestingly, one of the studies refers to the 
large effect on al induced by point-mutation. Taddei et al. 
[71] consider the O values of al to be unreliable because, 
according to their experiment, inducing a point-mutation 
on al makes mtAcP form amyloid fibrils. Thus, interpret- 
ing the folding segment of mtAcP is difficult. 

Our previous simplified Go-model simulations reveal 
that the interactions between the folding segments in 
the present definition are significant in the formation of 
transition state ensembles [75]. 

According to the discussion above, the conserved 
hydrophobic residues among homologues are distributed 
near F-value peaks (Figure 6), and they seem to be 
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(B> 

Figure 8 Hydrophobic interactions observed in the ferredoxin-like fold proteins. (A) Representation of internal conserved hydrophobic 
contacts. N- and C-terminal compact regions are colored yellow and green, respectively. The conserved hydrophobic residues near the F-value 
peaks have a space-filling representation. (B) Illustrations of important interactions among the secondary structures discussed in the current study, 
a helices are shown as circles, and the (3 strands are shown as triangles. The N- or C-terminal compact regions are colored in yellow or green, 
respectively, as in (A). The important interactions are indicated in red. 



involved in folding. Yet, as we mentioned in the intro- 
duction section, there are some studies that have shown 
that several sequences that fold into the same 3D struc- 
ture have the same folding process, while other studies 
have shown that two evolutionarily related sequences 
with the same 3D structure could have different folding 
processes. In this sense, whether all the homologues of 
our study proteins really have the same folding segments 
or not would be an interesting question. 

In the present study, we aim to analyze the conserva- 
tion of the predicted folding segments in the homo- 
logues of a protein by applying ADM analysis to them. 
The predicted folding segments are highly conserved 
among the highly homologous proteins (with roughly 
50% sequence identity on average) except for mtAcP and 
its homologous proteins, as shown in Figures 7 and 9. In 
Figure 7, the conservation of the predicted folding seg- 
ments is summarized as a histogram below each result. 
As seen in this figure, the histogram of mtAcP is uneven 
compared to the other histograms. Figure 9 depicts this 
property from another aspect. 

The abscissa in Figure 9 represents the lower limit of 
the sequence identity for calculating the average ADM 
similarity, that is, an average ADM similarity value is cal- 
culated using homologues with a sequence identity of 
more than a lower limit, while the ordinate denotes the 
average ADM similarity. The doubled line, solid line, thick 
line, and dotted line correspond to U1A, ADA2h, S6, and 
mtAcP, respectively. While the other three proteins main- 
tain their folding segment similarity of more than 75% 
even when the sequence identity decreases; only mtAcP 
loses similarity down to 62%. This result suggests a diver- 
sity of folding processes in mtAcP compared to those in 
the other proteins, especially when sequence identity is 
low. On the other hand, as we mentioned in the results 



section, the relationship between ADM similarity and 
sequence identity is not parallel (see Additional file 1: 
Figure S7). This is an unexpected result: we expected 
that the more similar the sequence identity is, the more 
similar the protein folding segments are. Yet the present 
results suggest that a property related to folding segments 
is conserved more than sequence identity. 

Summarizing the discussion above, there are mainly 
two situations. One of them clearly comprises a main 
large folding segment around cd, like in U1A. The other 
situation comprises complex folding segments in which, 




0.6 4— — , 

0.2 0.4 0.6 0.8 1 

Lower Limit of Sequence Identity 

Figure 9 The relationship between the similarity of the predicted 
folding segments and sequence identity. The abscissa denotes the 
lower limit of sequence identity and the ordinate denotes the average 
similarity of the predicted folding segments. The double line, solid line, 
thick line, and dotted line correspond to U1 A, ADA2h, S6, and mtAcP, 
respectively. 
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one of the a helices and its surrounding (3 strands play a 
key role at first, immediately followed by the other helix 
and its surrounding strands, as in ADA2h or S6. The 
homologous proteins of mtAcP have either property: 
some of them have folding segments similar to those of 
U1A proteins, and some others have folding segments 
similar to those of S6 or ADA2h proteins. This implies 
that mtAcP and its homologous proteins do not seem to 
have any common or rigid folding segments. 

Conclusions 

The secondary structures that are thought to play import- 
ant roles in folding as revealed by their average O values 
correspond to the folding segments predicted by ADM 
analyses at least for the proteins treated in this study, as 
was the case in our previous studies [33-36,47,76]. There 
are two predicted folding segments at the termini of each 
protein; however, which segment is primary is completely 
determined on a case-by-case basis. This tendency was 
also in good agreement with the experimental results for 
the present four study proteins. Some of the conserved 
hydrophobic contacts considered to play important roles 
in structural formation [49,53] are located near the F- 
value peaks. Therefore, we can predict the folding mecha- 
nisms by extracting the conserved hydrophobic residues 
near them. For the four proteins we studied above, we 
conclude that we succeeded in predicting their folding 
mechanisms correctly from only their sequences. 

According to the ADM results of the homologues, their 
folding segments seem to be conserved, especially when 
the sequence identity is above 80%. Below this level, only 
mtAcP represents a diversity of folding segments, whereas 
the other three proteins show high conservations. 

Our findings suggest that it should be possible to pre- 
dict the folding mechanisms or properties of many other 
kinds of proteins from only the amino acid sequences by 
means of our ADM analysis and F-value analysis. 

Additional file 



Additional file 1: Details of the ADM analysis and optional results 
are provided. 
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